The United States and Canada population are increasingly mobile. People move and relocate for various reasons. For example, people move from one metropolitan area to another because they are changing jobs. However, metropolitan areas such as large cities all have their own economic, social, and cultural characteristics and these characteristics are reflected among their unique neighborhoods and communities. For people planning relocation, any knowledge related to the neighborhoods of their destination cities and how they are similar or dissimilar to their current city neighborhoods will be valuable. For example, it may help people find their desired places to live in their new cities.
There are of course many different aspects that we can use when describing city neighborhoods and comparing them, such as populations, housing prices, traffic, and crime rates. For this project, we will focus on neighborhood points of interests (POIs) or venues.
In this project, we will analyze neighborhood venues for Toronto and New York City (NYC) and attempt to answer two related questions:
Neighborhood data for NYC will be acquired from this source https://geo.nyu.edu/catalog/nyu_2451_34572. We are interested in the following data elements for each neighborhood contained in the file:
Neighborhood data for Toronto will be acquired from Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Because this source doesn't provide latitude and longitude coordinates, a Geocoder Python package https://geocoder.readthedocs.io/index.html will be used to produce the coordinates for each Toronto neighborhood.
Also, we will use the Foursquare API to explore neighborhoods in NYC and Toronto. Specifically, Foursquare API provides explore functions to get venues and venue categories in each neighborhood. These venues will be used as features to group the neighborhoods into clusters.
NYC neighborhoods data acquired from the source is fairly clean and doesn't require any wrangling. However, for Toronto neighborhoods data scraped from Wikipedia page,we had to conduct the following transformation:
There are 306 neighborhoods in 5 boroughs in NYC. And there are 103 neighborhoods in 10 boroughs in Toronto.
Foursquare API is used to obtain neighborhood nearby venues and venue categories. For each neighborhood, radius 400 meters and venue limit 100 are used.
There are 415 uniques venue categories and 7717 venues for NYC neighborhoods. And there are 245 uniques venue categories and 1637 venues for Toronto neighborhoods.
We now have neighborhoods for both NYC and Toronto and their nearby venues in separate dataframe. Before we can compare them, we need to combine their neighborhoods venue data into a master dataframe. Since neighborhood names can be the same in the two cities, and we want to know which city a particular neighborhood belongs to, we will add city name to the neighborhood names when creating the master neighborhoods dataframe so that they are unique.
Since we will group neighborhoods according to types of nearby venues (i.e., venue categories) using cluster algorithm, we need to one-hot encode the venue categories.
We group the neighborhood venues dataframe by neighborhood and take the mean of the frequency of occurrence of each venue category. This will be used as the input data when running cluster algorithm.
We chose K-Means clustering algorithm for these reasons:
We chose K = 3 after experimented with other K values. An alternative would be using elbow method to determine the best K value.
One future improvement would be using K-Means++ as it may help picking the initial cluster centroids.
The entire neighborhoods in NYC and Toronto are grouped into three clusters.
This is the NYC neighborhoods map showing all clusters:
This is the Toronto neighborhoods map showing all clusters:
Recall that one of the goals of this project is to discover how NYC neighborhoods and Toronto neighborhoods are similar or dissimilar, in terms of nearby venues. So, let us examine each neighborhood clusters.
A couple of interesting observations:
Large number of both NYC and Toronto neighborhoods are in this cluster. Many of them have various types of restaurants, coffee shops, bakeries as their most common venues.
For example, all neighborhoods listed below have Chinese Restaurant as their 1st most common venue. However, one thing is kind of surprising: only one Toronto neighborhood has Chinese Restaurant as its 1st most common venue, the rest are all in NYC.
Some observations for this neighborhood cluster:
Recall that another goal of this project is that given a neighborhood in origin city, determine the similar neighborhoods in the destination city centered around a business location, for example, Google NYC office?
Let us demonstrate this using Toronto Weston as the origin city and neighborhood. We want to find all similar neighborhoods in NYC, display them on a map and also show the location of Google NYC office.
These are the NYC neighborhoods which are similar with Toronto Weston Neighborhood:
So, if you currently live in Toronto Weston neighborhood, and you are considering a job offer to move to NYC to work for Google, these NYC neighborhoods would be similar in terms of enjoying nearby parks, according to our neighborhood data analysis.
In this project, we have used unsupervised K-Means Clustering algorithm to develop a model which groups NYC and Toronto neighborhoods. The features used for grouping are neighborhoods' nearby venues and venues' categories. Foursquare API are used to acquire neighborhood's venues.
Entire neighborhoods in NYC and Toronto are grouped into three distinct clusters: Beach neighborhood, Restaurant and Food neighborhood, and Park neighborhood. Observations are made regarding on the similarity and difference between NYC and Toronto. In addition, given a neighborhood in origin city (such as Toronto Weston), this model can find similar neighborhoods in the destination city (NYC) centered around a business location, for example, Google NYC office.